James et al (2021): An Introduction to Statistical Learning. Chapter 10.
All the material presented on these module slides and in class.
Videos on neural networsk and back propagation
\(~\)
Secondary material (not compulsory):
\(~\)
See also References and further reading (last slide), for further reading material.
Todo: Update
\(~\)
Deep learning: The timeline
Single and multilayer neural networks
Convolutional neural networks
Recurrent neural networks
Interpolation and double descent
\(~\)
Neural networks (NN) were first introduced in the 1990’s.
Shift from statistics to computer science and machine learning, as they are highly parameterized
Statisticians were skeptical: ``It’s just a nonlinear model’’.
After the first hype, NNs were pushed aside by boosting and support vector machines.
Revival since 2010: The emergence of Deep learning as a consequence of improved computer resources, some innovations, and applications to image and video classification, and speech and text processing
Deep Learning is an algorithm which has no theoretical limitations of what it can learn; the more data you give and the more computational time you provide, the better it is.
Geoffrey Hinton (Google)
(based on Chollet and Allaire (2018))
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
So we first need to understand: what is a neural network?
Neuron and myelinated axon, with signal flow from inputs at dendrites to outputs at axon terminals. Image credits: By Egm4313.s12 (Prof. Loc Vu-Quoc) https://commons.wikimedia.org/w/index.php?curid=72816083
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
According to Chollet and Allaire (2018) (page 19):
Machine learning isn’t mathematics or physics, where major advancements can be done with a pen and a piece of paper. It’s an engineering science.
\(~\)
\(~\)
\(~\)
Recapitulate from Module 3 with the bodyfat dataset that contained the following variables.
bodyfat: % of body fat.age: age of the person.weight: body weighth.height: body height.neck: neck thickness.bmi: bmi.abdomen: circumference of abdomen.hip: circumference of hip.We will now look at modelling the bodyfat as response
and using all other variables as covariates - this will give us
Let \(n\) be the number of observations in the training set, here \(n=243\).
(from Module 3)
\(~\)
We assume \[ Y_i=\beta_0 + \beta_1 x_{i1}+\beta_2 x_{i2}+\cdots + \beta_p x_{ip}+\varepsilon_i={\boldsymbol x}_i^T{\boldsymbol\beta}+\varepsilon_i \ , \]
for \(i=1,\ldots,n\), where \(x_{ij}\) is the value \(j\)th predictor for the \(i\)th datapoint, and \({\boldsymbol\beta}^\top = (\beta_0,\beta_1,\ldots,\beta_p)\) the regression coeffficients.
\(~\)
We used the compact matrix notation for all observations \(i=1,\ldots,n\) together: \[{\boldsymbol Y}={\boldsymbol {X}} \boldsymbol{\beta}+{\boldsymbol{\varepsilon}} \ .\]
Assumptions:
The classical normal linear regression model is obtained if additionally
\(~\)
How can our statistical model be represented as a network?
\(~\)
We need new concepts:
\(~\)
\(~\)
\(~\)
\(~\)
## # weights: 8
## initial value 498619.660230
## iter 10 value 4521.462919
## final value 4415.453729
## converged
\(~\)
\[\begin{equation*} Y_i=\beta_0 + \beta_1 x_{i1}+\beta_2 x_{i2}+\cdots + \beta_p x_{ip}+\varepsilon_i \ . \end{equation*}\]
\(~\)
\(~\)
In the statistics world
we would have written \(\hat{y}_1({\boldsymbol x}_i)\) to specify that we are estimating a predicted value of the response for the given covariate value.
we would have called the \(w\)s \(\hat{\beta}\)s instead.
\(~\)
Remember: The estimator \(\hat{\boldsymbol \beta}\) is found by minimizing the RSS for a multiple linear regression model: \[ \begin{aligned} \text{RSS} &=\sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_{i1} - \hat \beta_2 x_{i2} -...-\hat \beta_p x_{ip} )^2 \\ &= \sum_{i=1}^n (y_i-{\boldsymbol x}_i^T \boldsymbol \beta)^2=({\boldsymbol Y}-{\boldsymbol X}\hat{\boldsymbol{\beta}})^T({\boldsymbol Y}-{\boldsymbol X}\hat{\boldsymbol{\beta}}) \ .\end{aligned} \] Solution: \[ \hat{\boldsymbol\beta}=({\boldsymbol X}^T{\boldsymbol X})^{-1} {\boldsymbol X}^T {\boldsymbol Y} \ .\]
We now translate from the statistical into the neural networks world:
\(~\)
\(~\)
\(~\)
\(~\)
(https://github.com/SoojungHong/MachineLearning/wiki/Gradient-Descent)
\(~\)
Here we compare
lmnnet \(~\)
Linear regression vs. neural networks: an example.
\(~\)
fit = lm(bodyfat ~ age + weight + height + bmi + neck + abdomen + hip,
data = d.bodyfat)
fitnnet = nnet(bodyfat ~ age + weight + height + bmi + neck + abdomen +
hip, data = d.bodyfat, linout = TRUE, size = 0, skip = TRUE, maxit = 1000,
entropy = FALSE)
## # weights: 8
## initial value 114935.463194
## iter 10 value 4471.423775
## final value 4415.453729
## converged
cbind(fitnnet$wts, fit$coefficients)
## [,1] [,2]
## (Intercept) -9.748907e+01 -9.748903e+01
## age -9.607663e-04 -9.607669e-04
## weight -6.292823e-01 -6.292820e-01
## height 3.974886e-01 3.974884e-01
## bmi 1.785331e+00 1.785330e+00
## neck -4.945725e-01 -4.945725e-01
## abdomen 8.945189e-01 8.945189e-01
## hip -1.255549e-01 -1.255549e-01
Aim is to predict if a person has diabetes. The data stem from a
population of women of Pima Indian heritage in the US, available in the
R MASS package. The following information is available for
each woman:
\(~\)
diabetes: 0= not present, 1=
presentnpreg: number of pregnanciesglu: plasma glucose concentration in an oral glucose
tolerance testbp: diastolic blood pressure (mmHg)skin: triceps skin fold thickness (mm)bmi: body mass index (weight in kg/(height in m)\(^2\))ped: diabetes pedigree function.age: age in years\(~\)
\(~\)
\(~\)
(Maximum likelihood)
\[ \ln(L(\boldsymbol{\beta}))=l(\boldsymbol{\beta}) =\sum_{i=1}^n \Big ( y_i \ln p_i + (1-y_i) \ln(1 - p_i )\Big ) \ .\]
\(~\)
\(~\)
\[ \hat{y}_1({\boldsymbol x}_i)= \sigma({\boldsymbol x}_i) = \frac{1}{1+\exp(-(w_0+w_1 x_{i1}+\cdots + w_r x_{ir}))} \in (0,1) \ . \]
\(~\)
\(~\)
\(~\)
fitlogist = glm(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
data = train, family = binomial(link = "logit"))
summary(fitlogist)
##
## Call:
## glm(formula = diabetes ~ npreg + glu + bp + skin + bmi + ped +
## age, family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9830 -0.6773 -0.3681 0.6439 2.3154
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.773062 1.770386 -5.520 3.38e-08 ***
## npreg 0.103183 0.064694 1.595 0.11073
## glu 0.032117 0.006787 4.732 2.22e-06 ***
## bp -0.004768 0.018541 -0.257 0.79707
## skin -0.001917 0.022500 -0.085 0.93211
## bmi 0.083624 0.042827 1.953 0.05087 .
## ped 1.820410 0.665514 2.735 0.00623 **
## age 0.041184 0.022091 1.864 0.06228 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 256.41 on 199 degrees of freedom
## Residual deviance: 178.39 on 192 degrees of freedom
## AIC: 194.39
##
## Number of Fisher Scoring iterations: 5
set.seed(787879)
library(nnet)
fitnnet = nnet(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
data = train, linout = FALSE, size = 0, skip = TRUE, maxit = 1000,
entropy = TRUE, Wts = fitlogist$coefficients + rnorm(8, 0, 0.1))
## # weights: 8
## initial value 213.575955
## iter 10 value 89.511044
## final value 89.195333
## converged
# entropy=TRUE because default is least squares
cbind(fitnnet$wts, fitlogist$coefficients)
## [,1] [,2]
## (Intercept) -9.773046277 -9.773061533
## npreg 0.103183171 0.103183427
## glu 0.032116832 0.032116823
## bp -0.004767678 -0.004767542
## skin -0.001917105 -0.001916632
## bmi 0.083624151 0.083623912
## ped 1.820397792 1.820410367
## age 0.041183744 0.041183529
\(~\)
By setting entropy=TRUE we minimize the cross-entropy
loss.
plotnet(fitnnet)
But, there may also exist local minima.
\(~\)
set.seed(123)
fitnnet = nnet(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
data = train, linout = FALSE, size = 0, skip = TRUE, maxit = 10000,
entropy = TRUE, Wts = fitlogist$coefficients + rnorm(8, 0, 1))
## # weights: 8
## initial value 24315.298582
## final value 12526.062906
## converged
cbind(fitnnet$wts, fitlogist$coefficients)
## [,1] [,2]
## (Intercept) -36.733537 -9.773061533
## npreg -77.126994 0.103183427
## glu -2984.409175 0.032116823
## bp -1835.934259 -0.004767542
## skin -718.072629 -0.001916632
## bmi -818.561311 0.083623912
## ped -8.687473 1.820410367
## age -773.023878 0.041183529
Why can NN and logistic regression lead to such different results?
\(~\)
The iris flower data set was introduced by the British
statistician and biologist Ronald Fisher in 1936.
\(~\)
Sepal.Length,
Sepal.Width, Petal.Length and
Petal.Width.\(~\)
The aim is to predict the species of an iris plant.
\(~\)
We only briefly mentioned multiclass regression in module 4.
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
First select a training sample
\(~\)
library(nnet)
set.seed(123)
train = sample(1:150, 50)
iris_train = ird[train, ]
iris_test = ird[-train, ]
\(~\)
Then fit the nnet() (by default using the softmax
activation function)
set.seed(1234)
iris.nnet <- nnet(species ~ ., data = ird, subset = train, size = 0,
skip = TRUE, maxit = 100)
## # weights: 15
## initial value 105.595764
## iter 10 value 1.050064
## iter 20 value 0.018814
## iter 30 value 0.003937
## iter 40 value 0.002062
## iter 50 value 0.001460
## iter 60 value 0.000150
## iter 70 value 0.000125
## iter 80 value 0.000110
## final value 0.000096
## converged
How many weights (parameters) have been estimated?
What does the graph look like?
\(~\)
summary(iris.nnet)
## a 4-0-3 network with 15 weights
## options were - skip-layer connections softmax modelling
## b->o1 i1->o1 i2->o1 i3->o1 i4->o1
## 36.79 9.69 1.26 -14.33 -13.24
## b->o2 i1->o2 i2->o2 i3->o2 i4->o2
## 5.16 10.10 21.33 -25.47 -12.48
## b->o3 i1->o3 i2->o3 i3->o3 i4->o3
## -42.03 -20.25 -23.12 40.80 25.96
Multinomial regression is also done with nnet, but using a wrapper
multinom (we use default settings, so results are not
necessarily the same as above).
\(~\)
library(caret)
fit = multinom(species ~ ., family = multinomial, data = iris_train,
trace = FALSE)
coef(fit)
## (Intercept) Sepal.L. Sepal.W. Petal.L. Petal.W.
## s -15.67359 -8.679545 42.40492 -25.94868 -7.132379
## v -157.32804 -19.694488 -17.03445 46.05297 59.135341
fit$wts
## [1] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [7] 0.000000 -15.673595 -8.679545 42.404924 -25.948681 -7.132379
## [13] 0.000000 -157.328044 -19.694488 -17.034447 46.052968 59.135341
\(~\)
\(~\)
\(~\)
\(~\) But:
These are only linear models (linear boundaries).
Parameters (weights) found using gradient descent algorithms where the learning rate (step length) must be set.
Connections are only forward in the network, but no feedback connections that sends the output of the model back into the network.
Examples: Linear, logistic and multinomial regression with or without any hidden layers (between the input and output layers).
We may have between zero and very many hidden layers.
Adding hidden layers with non-linear activation functions between the input and output layer will make nonlinear statistical models.
The number of hidden layers is called the depth of the network, and the number of nodes in a layer is called the width of the layer.
test
\(~\)
\(~\)
\(~\)
The universal approximation theorem says that a feedforward network with
can approximate any (Borel measurable) function from one finite-dimensional space (our input layer) to another (our output layer) with any desired non-zero amount of error.
In particular, the universal approximation theorem holds for
in the hidden layer.
nnet and keras R packages\(~\)
We will use both the rather simple nnet R package by
Brian Ripley and the currently very popular keras package
for deep learning (the keras package will be presented
later).
nnet fits one hidden layer with sigmoid
activiation function. The implementation is not gradient descent,
but instead BFGS using optim.
Type ?nnet() into your R-console to see the
arguments of nnet().
If the response in formula is a factor, an appropriate classification network is constructed; this has one output, sigmoid activation and binary entropy loss for a binary response, and a number of outputs equal to the number of classes, softmax activation and categorical cross-entropy loss for more levels.
Objective: To predict the median price of owner-occupied homes in a given Boston suburb in the mid-1970s using 10 input variables.
This data set is both available in the MASS and
keras R package.
\(~\)
keras library)